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Module 3: Review of Basic Data Analytic 
Methods Using R 

Upon completion of this module, you should be able to: 

• Use basic analytics methods such as distributions, 
statistical tests and summary operations to 
investigate a data set. 

• Use R as a tool to perform basic data analytics, 
reporting and basic data visualization. 
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Putting the Data Analytics Lifecycle into Practice 


• From Module 2 you learned a strategy to approach any data 
analytics problem: 

• Phase 1: Discovery 

• Phase 2: Data Preparation 

• Phase 3: Model Planning (covered in Module 4) 

• Phase 4: Model Building 

• Phase 5: Communicate Results 

• Phase 6: Operationalize 

• To begin to analyze the data you need: 

► 1. A tool that allows you to look at the data - that is "R". 

► 2. Skill in basic statistics - we're providing a refresher. 
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Module 3: Review of Basic Data Analytic Methods 
Using R 

Part 1: Using R to Look at Data - Introduction to R 


During this lesson the following topics are covered: 

• Using the R Graphical User Interface 

• Overview: Getting Data into (and out of) R 

• Data Types Used in R 

• Basic R Operations 

• Basic Statistics 

• Generic Functions 



GETTING A HANDLE 
ON THE DATA 
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Five Things to Remember About R 

1. (Almost) everything is a object 

2. (Almost ) everything is a vector 

► Example: x <- 3 — x is a vector of length 1 

v <- c(2,4,6,8,10) — v is a vector of length 5 

3. All commands are functions 

► Example: quit() or q ( ) , not q 

4. Some commands produce different output depending... 

5. Know your default arguments! 
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Using the RStudio Graphical User Interface 
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Overview: Getting Data Into (and Out of) R 


• Getting Data Into R 

► Type it in (if it's small)! 

► Read from a data file 

► Read from a database 

• Getting Data Out of R 

► Save in a workspace 

► Write a text file 

► Save an object to the file system 

► You can save plots as well! 
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Typing Data Into R 


To RStudio 



<- CO 

O File Edi 

<L' dr 


View 

S S 


£jmba.Rx Untitledl * x 
Q O Source on Save 

1 v = c(l : 10) 

2 edit(v) 


Project Workspace 

I (S I {*_■ 


q, / - 


Plots Tools Help 


4 Dim 1)4 

Edit 


1 1:10 
2 


2:3 H (Top Level) - 


Console O 

Error in edit(name, file, title, editor) : 
unexpected occurred on line 1 
use a command like 
x <- edit() 
to recover 

> ?edit 

> fix(v) 

> edit(v) 


r -*4 > 


01 Hll ^ SS /2 
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Save 


a 


☆ 


gpadmin I Sign Out 
' Project: (None) * 


Workspace History 

<=3l*lnari» Cl Save» Import Dataset" .jfclearAII 

40 obs. of 2 variables 
40 obs. of 2 variables 


ckages Help 

kr © 


method: 

file = title 
getOption ( "editor”) , 

I L, file = "") 

file = "") 
i, file = "") 

NOLL, file = "") 
xedit (name = NULL, file = "") 


NULL, 


Cancel 


Arguments 

name a named object that you want to edit If name is missing then the 
file specified by file is opened for editing 


l ■>) © 5 <5 r? i- ...it i* 


12:44 PM 
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Getting Data Into R: External Sources 

• R supports multiple file formats 

► read . table () is the main function 

• File name can be a URL 

► read . table ( "http : / / ahost/ f ile . csv " , sep=",") is the 
same as read, csv (...) 

• Can read directly from a database via ODBC interface 

► mydb <- odbcConnect ( "MyPos tgresDB " , ...) 

R packages exist to read data from Hadoop or HDFS (more later) 


Note! R always uses the forward-slash (“/”) character in full file names 

“C:/users/janedoe/My Documents/Script.R” 
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Getting Data Out of R 


Options 

R Code 

Save it as part of your 

save . image ( f ile="dfm . Rdata" ) 

workspace (or a different 

save. image () # a .Rdata file 

workspace) 

load. image ("dfm. Rdata") 

Save it as a data file 

write . csv (dfm, f ile="dfm. csv") 

Save it as an R object 

save (Mydata, 

f ile="Mydata . Rdata" ) 
load ( f ile="Mydata . Rdata" ) 

Plots can be saved as images 

saveplot ( f ilename=" filename . ext" , 
type=" type" ) 
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Data Classification: A Quick Review 


Data “Noir” 

Examples 

Nominal 

condo, house, rental 

Ordinal 

hates < dislikes <neutral < likes < loves 

Interval 

1 0F colder tomorrow than today 

Ratio 

5342 > 4321 


Some statistical tests require data at the interval level or higher. 
Other tests assume ordinal or nominal. Make sure you check. 
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Data Types Used in R 


Data Types 

R Code 

Numbers, Strings 

n <- 3 

s <- “columbus, Ohio” 

Vectors 

levels <- c(“Wow”, “Good”,“Bad”) 
ratings <- c(“Bad”, “Bad”, “Wow”) 

Factors and Lists 

f <- factor(ratings, levels) 


1 <- list(ratings=ratings, 


critics=c(“Siskel”, “Ebert”)) 

Functions 

stdev <- function(x) {sd(x)J 
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R Structured Types 


Data Types 

R Code 

Matrix - (n*m numeric data 

m <- matrix( c(1 :3, 11 :13), nrow = 2, ncol = 

frame) 

3, byrow = TRUE) 

Table - contingency table 

t <- table(dfm$factor_variable) 

Data frames - data sets 

dfm <- 

read. csv(“CrimeRatesByStates2005. csv”) 

Extracting data 

xdfm <- dfm[1 :3,] 
ydfm <- dfm[, 3:5] 
s <- dfm$state 
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Basic R Operations on Vectors 


Function 

R Code 

Operations on Vectors 

v <- c(1 :10); w <- c(15:24) ; nv <- v * pi ; nw <- w * 

V 

Vector transformations 

radius <- sqrt( d$area)/ pi) 
t <- as.table(dfm$factor_variable) 
pet <- t/sum(t)* 100 

Logical Vectors 

v[v< 1000] 

ndf <- subset(dfm, d$population < 10000) 
nv <- v[c(1 ,2,3,5,8,13)] 

Examining data structures 

dim(dfm); attributes(dfm) ; class(dfm); typeof(dfm) 
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Descriptive Statistics 


Function 

R Code 

View the data 

head(x); tail(x) 

View a summary of the data 

summary(x) 

Compute basic statistics 

sd(x); var(x); range(x); IQR(x) 

Correlation 

cor(x); cor(d$var1, d$var2) 
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Generic Functions 


• Also known as method overriding in OO-land 

• Specific actions that differ based on the class of the object : 


Code 

Function 

Plot the variable x 

plot (x) 

Histogram of x 

hist (x) 

Internal structure of x 

str (x) 


• Good for initial data exploration (more later) 



Copyright © 2014 EMC Corporation. All Rights Reserved. Module 3: Basic Data Analytic Methods Using R 18 



Check Your Knowledge 


• Which data structures in R are the most used? Why? 

• Consider the cbindQ function and the rbindQ function that bind 
a vector to a data frame as a new column or a new row. When 
might these functions be useful? 
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Module 3: Review of Basic Data Analytic Methods 
Using R 

Part 1: Summary 

During this lesson the following topics were covered: 

• How to use the R Graphical User Interface 

• How to get data into (and out of) R 

• Data Types used in R, and the basic R operations 

• Basic descriptive statistics 

• Using generic functions 
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Lab Exercise 2: Introduction to R 



This lab is designed to investigate and practice working 
with R and using it to examine data. 

• After completing the tasks in this lab you should able to: 

• Read data sets into R, save them, and examine the 
contents 



Copyright © 2014 EMC Corporation. All Rights Reserved. 


Module 3: Basic Data Analytic Methods Using R 21 


Lab Exercise 2: Introduction to R 



• Invoke the R environment 



• Examine the Workspace 


9 

• Getting Familiar with R 


S 

• Read in the Lab Script 



• Working with R : reading external data 


| 

• Verify the contents of the tables 


3 

• Manipulating data frames in R 


• Investigate your data 



i 

• Save the data sets 



• Continue investigating the data 


fi 

• Exit R 
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Module 3: Review of Basic Data Analytic Methods 
Using R 

Part 2: Analyzing and Exploring the Data 


During this lesson the following topics are covered: 

• Why visualize? 

• Examining a single variable 

• Examining pairs of variables 

• Indications of dirty data. 

• Data exploration vs. presentation 
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Why Visualize? 


Summary statistics give us some sense of the data: 

► Mean vs. Median. 

► Standard deviation. 

► Quartiles, Min/Max. 

► Correlations between variables. 


summary (data) 


X 

Y 


Min . 

-3.05439 Min. 

-3.50179 

1st Qu . 

-0.61055 1st Qu. 

-0.75968 

Median 

0.04666 Median 

0.07340 

Mean 

-0.01105 Mean 

0.09383 

3rd Qu . 

0.56067 3rd Qu . 

0.88114 

Max . 

2 . 60614 Max. 

4.28693 



- 3 - 2-1012 

X 


Visualization gives us 
a more holistic sense 
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Anscombe’s Quartet 

4 data sets, characterized by the following. 


Property 

Values 

Mean of x in each case 

9 

Exact variance of x in 
each case 

11 

Exact mean of y in each 
case 

7.5 (to 2 d.p) 

Variance of Y in each 

case 

4.13 (to 2 
d.p) 

Correlations between x 
and y in each case 

0.816 

Linear regression line in 
each case 

Y = 3.00 + 
0.500x (to 2 
d.p and 3 d.p 
resp.) 
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X y 
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Moral: Visualize Before Analyzing! 




Xl 


x 2 




x 3 


x 4 
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Visualizing Your Data 


Examining the distribution of a single variable 
Analyzing the relationship between two variables 
Establishing multiple pair wise relationships between variables 
Analyzing a single variable over time 
Data exploration versus data presentation 
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Examining the Distribution of a Single Variable 


Graphing a single variable 

• plot(sort(.)) - for low volume data 

• hist(.) - a histogram 

• plot(density(.)) - densityplot 

► A "continuous histogram" 

• Example 

► Frequency table of household 
income 



10000 238710 484700 


nhl 
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Examining the Distribution of a Single Variable 


Graphing a single variable 

• plot(sort(.)) - for low volume 

• hist(.) - a histogram 

• plot(density(.)) - densityplot 

► A "continuous histogram" 

• Example 

► Frequency table of household 
income 

►► rug() plot emphasizes 
distribution 


Distribution log10(HouseHold Income) 



N = 120529 Bandwidth = 0.03302 
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8e-04 


What are we looking for? 

A sense of the data range 

• If it's very wide, or very skewed, try computing the 
log 

Outliers, anomalies 

• Possibly evidence of dirty data 

Shape of the Distribution 

• Unimodal? Bimodal? 

• Skewed to left or right? 

• Approximately normal? Approximately lognormal? 
Example - Distribution of purchase size ($) 

• Range from 0 to > $10K, right skewed 

• Typical of monetary data 

• Plotting log of data gives better sense of distribution 

• Two purchasing distributions 

► ~ $55 

► ~ $2900 


6e-04 - 


35 

S 4e-04 - 
Q 


2e-04 - 


0e+00 - 



0 2000 4000 6000 8000 10000 


purchase size (dollars) 


0 2 4 6 8 10 
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Evidence of Dirty Data 


Accountholder age distribution 


Missing 

values? 



Mis-entered 

data? 

Inherited 

accounts? 


age 
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200 400 600 800 1200 


"Saturated" Data 


Portfolio Distribution, Years since origination 



Do we really have no 
mortgages older than 
10 years? 

Or does the year 
2004 in the origination 
field mean "2004 or 
prior"? 


0 2 4 6 8 10 

Mortgage Age 
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Analyzing the Relationship Between Two Variables 


How? 

• Two Continuous Variables (or two discrete variables) 

► Scatterplots 

► LOESS (fit smoothed line to the data) 

► Linear models: graph the correlation 

► Binplots, hexbin plots 

►► More legible color-based plots for high 
volume data 

• Continuous vs. Discrete Variable 

► Jitter, Box and whisker plots, Dotplot or 
barchart 



4 2 - 


4 . 0 - 


Example: 

• Household income by region (ZIP1) 

• Scatterplot with jitter, with box-and-whisker overlaid 

• New England (0) and West Coast (9) have highest 
mean household income 


Zipl 


O 
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Two Variables: What are we looking for? 


• Is there a relationship 
between the two variables? 

► Linear? Quadratic? 

► Exponential? 

►► Try semi-log or log-log plots 

► Is it a cloud? 

►► Round? Concentrated? Multiple 
Clusters? 

• How? 

► Scatterplots 

• Example 

► Red line: linear fit 

► Blue line: LOESS 

► Fairly linear relationship, but with wide 
variance 
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logl O(MeanHouseholdlncome) 


Two Variables: High Volume Data - Plotting 



J I I L 



0 5 10 15 


MeanEducation 


Counts 
^7188 
X6328 
X5522 
X4771 
X4075 
X3434 
X2848 
X2316 
XI 840 
X1418 
XI 051 
X 739 

n 482 

n 279 
x 132 


MeanEducation 


Scatterplot: 

Overplotting makes it difficult 
to see structure 


Hexbinplot: 

Now we see where the data is 
concentrated. 
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Establishing Multiple Pairwise Relationships 
Between Variables 


• Why? 


Anderson's Iris Data - 3 species 



► Examine many two-way 
relationships quickly 

How? 

► pairs(ds) can generate a plot of 
each pairs of variables 

Example 

Iris Characteristics 

►► Strong linear relationship 
between petal length and 
width 

►► Petal dimensions discriminate 
species more strongly than 
sepal dimensions 


2.0 3.0 4.0 0.5 1.5 2.5 


Sepal.Length 

o 

J' 

1 1 W; 

Jjjffc: 

h SsH *- 
# . 

• 

- 

- Q 

Sepal. Width 

1#: 

Q 

8 • r- V 

WO MM 

Q 


o 

Petal. Length 

— 1 — 1 — 1 — 1 — 4 

% 

4 

. 

— oo 

— oj au>o 

*4 Ll 

vv? i i i n 1 

• oooo'cPi^f; 

O 

i — i — i — i — i 

Petal .Width 


4.5 5.5 6.5 7.5 1 2 3 4 5 6 7 
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Analyzing a Single Variable over Time 


What? 

• Looking for ... 

► Data range 

► Trends 

► Seasonality 
How? 

• Use time series plot 
Example 

•International air travel (1949-1960) 

• Upward trend: growth appears 
superlinear 

• Seasonality 

► Peak air travel around Nov. with smaller 
peaks near Mar. and June 
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number of accounts Density 


Data Exploration vs. Presentation 


Distribution of account values (loglO scale) 



Data Exploration: 

This tells you what you 
need to know. 


Distribution of account values 



< IK 1-5K 5-1 OK 10-50K 50-1 00K 100-500K 500K-1M >1M 


account value (dollars) 


Presentation: 

This tells the stakeholders 
what they need to know. 
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Check Your Knowledge 


• Do you think the regression line sufficiently captures the 
relationship between the two variables? What might you do 
differently? 


• In the Iris slide example, how would you characterize the 
relationship between sepal width and sepal length? 


• Did you notice the use of color in the Iris slide? Was it effective? 
Why or why not? 
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Module 3: Review of Basic Data Analytic Methods 
Using R 

Part 2: Summary 


During this lesson the following topics were covered: 

• Justifying why we visualize data 

• Using plots and graphs to determine: 

• Shape of a single variable 

• "dirty" data or "saturated" data 

• Relationship between two or more variables 

• Relationship between multiple variables 

• A single variable over time 

• Data exploration versus Presentation 
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